Lab2 tutorial

Yue You

2021-08-02

Lab 2 Overview

First Half

More Visualisations:

  • ggplot2 (ggmap)

  • HexBins vs SquareBins

  • Choropleths

Descriptive Statistics:

  • Histograms and looking at the distribution shape.

  • How to “quick plot” dataframes using ggplot2.

A brief summary:

Second Half

Revision:

  • Any code related questions for R

Let’s get started

Load libraries

library(dplyr)
library(sf)
library(curl)
library(ggmap)
library(tmap)
library(tmaptools)
library(feather)

Load data

#try feather 
filepath = "../data/df.feather"
df <- read_feather(filepath)
df %>% tail
## # A tibble: 6 x 19
##   VendorID tpep_pickup_dateti… tpep_dropoff_datet… passenger_count trip_distance
##      <int> <fct>               <fct>                         <int>         <dbl>
## 1        2 4/12/15 22:55       4/12/15 23:28                     2         12.8 
## 2        2 4/12/15 22:55       4/12/15 23:03                     1          0.75
## 3        1 4/12/15 22:55       4/12/15 23:08                     1          2.4 
## 4        1 4/12/15 22:55       4/12/15 23:01                     1          0.8 
## 5        2 4/12/15 22:55       4/12/15 23:17                     1          4.73
## 6        2 4/12/15 22:55       4/12/15 22:59                     2          0.8 
## # … with 14 more variables: pickup_longitude <dbl>, pickup_latitude <dbl>,
## #   RatecodeID <int>, store_and_fwd_flag <fct>, dropoff_longitude <dbl>,
## #   dropoff_latitude <dbl>, payment_type <int>, fare_amount <dbl>, extra <dbl>,
## #   mta_tax <dbl>, tip_amount <dbl>, tolls_amount <dbl>,
## #   improvement_surcharge <dbl>, total_amount <dbl>

ggplot2

It’s hard to succinctly describe how ggplot2 works because it embodies a deep philosophy of visualisation. However, in most cases you start with ggplot(), supply a dataset and aesthetic mapping (with aes()). You then add on layers (like geom_point() or geom_histogram()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()).

library(ggplot2)
ggplot(df, aes(total_amount, trip_distance, colour = as.factor(payment_type))) + 
  geom_point()

In spatial statistics the ability to visualize data and models superimposed with their basic social landmarks and geographic context is invaluable. ggmap is a new tool which enables visualization by combining the spatial information of static maps from Google Maps, OpenStreetMap, etc., with the layered grammar of graphics implementation of ggplot2.

Select the map.

xranges <- range(df$pickup_longitude[!df$pickup_longitude==0])
yranges <- range(df$pickup_latitude[!df$pickup_latitude==0])
xranges
## [1] -77.04710 -71.06483
yranges
## [1] 37.27044 42.73614
map_big <- get_stamenmap(
  rbind(-74.3,40.45,-73.6,40.95), 
  zoom = 10)
ggmap(map_big)

ggmap(map_big) + 
  geom_point(data = df,
             aes(x = dropoff_longitude,
                 y = dropoff_latitude),
             colour="red", size = 0.05) +
  geom_point(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             colour="blue", size = 0.05) 

Heatmaps

  • Excellent tool when visualising geospatial coordinates.

Plot pickup locations.

ggmap(map_big) + 
  geom_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) 

Change the color scale.

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) +
  scale_fill_viridis_c(trans = "log10")

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude),
             bins=50) +
  scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white"),trans="log10")

ggmap(map_big) +
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 alpha = 0.95),
             bins=50) +
  scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white"),trans="log10")

Filled by discrete factors.

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 fill = as.factor(payment_type)),
             bins=50) 

Change the color palette. A good palette makes your story better.

library(RColorBrewer)
display.brewer.all()

ggmap(map_big) + 
  stat_bin2d(data = df,
             aes(x = pickup_longitude,
                 y = pickup_latitude,
                 fill = as.factor(payment_type)),
             bins=100) +
  scale_fill_brewer(palette = "Dark2")

ggmap(map_big) + 
  stat_bin2d(data=df,
           aes(x = pickup_longitude, y = pickup_latitude),
           alpha = 0.9,
           bins=100) +
  scale_fill_viridis_c(trans = "log10") +
  facet_grid(.~payment_type)

Hexagon bins avoid the visual artefacts sometimes generated by the very regular alignment of geom_bin2d().

Plot HexBin.

ggmap(map_big) + 
  stat_binhex(data=df,
           aes(x = pickup_longitude, y = pickup_latitude),
               bins=100) +
  coord_cartesian()  +
  scale_fill_gradient2(trans="log10")

Shapefile Overlays

  • NOTE: This only applies on the more recent datasets that use zones over coordinates

Requirements:

  • ggplot2/ ggmap/ tmap/ leaflet
df_yellow = read.csv("../data/yellow_tripdata_2019-01.csv",stringsAsFactors = TRUE)
df_yellow %>% 
  group_by(PULocationID, payment_type) %>% 
  summarise(n = n()) -> df_yellow_summary
## `summarise()` has grouped output by 'PULocationID'. You can override using the `.groups` argument.
library(sf)

Shapefile Links:

A shapefile is a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas).

taxi_zones.zip taxi+_zone_lookup.csv

Basic information in shapefile.

sf = st_read("../data/shapefile/taxi_zones/")
## Reading layer `taxi_zones' from data source 
##   `/mnt/unimelb.edu.au/yyyou1/MAST30034_R/data/shapefile/taxi_zones' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
plot(sf)

Choropleth mapping with ggplot2

ggplot() + 
  geom_sf(data = sf, size = 1, color = "black", fill = "cyan1") + 
  ggtitle("Boundary Plot") + 
  coord_sf()+
  theme_bw()

Adding basemaps with ggmap

ggmap(map_big) +
    geom_sf(data = sf, aes(fill=Shape_Area), inherit.aes = FALSE)  +
  coord_sf(crs = st_crs(4326)) +
  scale_fill_viridis_c()

Choropleth with tmap

tmap is specifically designed to make creation of thematic maps more convenient. It borrows from thw ggplot syntax and takes care of a lot of the styling and aesthetics. This reduces our amount of code significantly. We only need:

  • tm_shape() where we provide the sf object (we could also provide an SpatialPolygonsDataframe)

  • tm_polygons() where we set the attribute variable to map, the break style, and a title.

tm_shape(sf)+
   tm_polygons("Shape_Area",style="quantile")

Incorporate information we have based on pickup locations to the simple feature collection.

df_yellow_summary[!duplicated(df_yellow_summary$PULocationID),]-> df_yellow_summary
sf_join <- 
left_join(sf,df_yellow_summary,
          by=c("LocationID"="PULocationID")) 
tm_shape(sf_join)+
   tm_polygons("payment_type",style="cat")

tmap_mode("view")
## tmap mode set to interactive viewing
tmap_last()

Read more about Reprojecting geographic data.

Descriptive Statistics

filepath = "../data/df.feather"
df <- read_feather(filepath)
summary(df)
##     VendorID        tpep_pickup_datetime   tpep_dropoff_datetime
##  Min.   :1.000   4/12/15 19:14:  508     4/12/15 21:00:  487    
##  1st Qu.:1.000   4/12/15 22:44:  506     4/12/15 21:40:  486    
##  Median :2.000   4/12/15 21:57:  503     4/12/15 20:14:  484    
##  Mean   :1.528   4/12/15 22:46:  491     4/12/15 20:58:  480    
##  3rd Qu.:2.000   4/12/15 19:39:  489     4/12/15 21:01:  480    
##  Max.   :2.000   4/12/15 19:28:  486     4/12/15 20:31:  476    
##                  (Other)      :97017     (Other)      :97107    
##  passenger_count trip_distance    pickup_longitude pickup_latitude
##  Min.   :0.000   Min.   : 0.000   Min.   :-77.05   Min.   : 0.00  
##  1st Qu.:1.000   1st Qu.: 1.030   1st Qu.:-73.99   1st Qu.:40.73  
##  Median :1.000   Median : 1.710   Median :-73.98   Median :40.75  
##  Mean   :1.704   Mean   : 2.827   Mean   :-72.88   Mean   :40.14  
##  3rd Qu.:2.000   3rd Qu.: 3.100   3rd Qu.:-73.97   3rd Qu.:40.77  
##  Max.   :6.000   Max.   :91.200   Max.   :  0.00   Max.   :42.74  
##                                                                   
##    RatecodeID     store_and_fwd_flag dropoff_longitude dropoff_latitude
##  Min.   : 1.000   N:99421            Min.   :-74.62    Min.   : 0.00   
##  1st Qu.: 1.000   Y:  579            1st Qu.:-73.99    1st Qu.:40.73   
##  Median : 1.000                      Median :-73.98    Median :40.75   
##  Mean   : 1.029                      Mean   :-72.93    Mean   :40.17   
##  3rd Qu.: 1.000                      3rd Qu.:-73.96    3rd Qu.:40.77   
##  Max.   :99.000                      Max.   :  0.00    Max.   :41.49   
##                                                                        
##   payment_type   fare_amount          extra            mta_tax       
##  Min.   :1.00   Min.   :-120.00   Min.   :-1.0000   Min.   :-0.5000  
##  1st Qu.:1.00   1st Qu.:   7.00   1st Qu.: 0.5000   1st Qu.: 0.5000  
##  Median :1.00   Median :  10.00   Median : 0.5000   Median : 0.5000  
##  Mean   :1.34   Mean   :  12.75   Mean   : 0.6049   Mean   : 0.4985  
##  3rd Qu.:2.00   3rd Qu.:  15.00   3rd Qu.: 0.5000   3rd Qu.: 0.5000  
##  Max.   :4.00   Max.   : 500.00   Max.   : 1.5000   Max.   : 2.5000  
##                                                                      
##    tip_amount      tolls_amount     improvement_surcharge  total_amount    
##  Min.   :  0.00   Min.   : 0.0000   Min.   :-0.3000       Min.   :-120.30  
##  1st Qu.:  0.00   1st Qu.: 0.0000   1st Qu.: 0.3000       1st Qu.:   9.30  
##  Median :  1.46   Median : 0.0000   Median : 0.3000       Median :  12.80  
##  Mean   :  1.80   Mean   : 0.2386   Mean   : 0.2997       Mean   :  16.19  
##  3rd Qu.:  2.55   3rd Qu.: 0.0000   3rd Qu.: 0.3000       3rd Qu.:  18.36  
##  Max.   :115.00   Max.   :24.0000   Max.   : 0.3000       Max.   : 550.30  
## 

Remember that not all the data should be interpreted as purely numerical. There may be conclusions you can draw by coincidence if you incorrectly assume data types! For example:

  • longitude and latitude should be interpreted as geospatial coordinates

  • payment_type is a discrete category of payment types

  • trip_distance is non-linear (not a straight line from A to B), but we have no further data on it. It is also in miles.

  • To avoid misinterpreting the attributes, refer to the data dictionary provided on the TLC website.

Scatterplot plots for Fare / Distance

ggplot(df) +
  aes(x=fare_amount,y=trip_distance) +
  geom_point()+
  theme_bw()

General Inference:

  • We can visually see that the relationship is relatively linear as you’d expect (more distance generally means more money)
  • A fair number of outliers, notably around the 0 distance axis and 0 cost axis. Why might this be the case?
  • Why are there negative values?

Histgram plots for Fare amount

ggplot(df)+
  geom_histogram(aes(x=fare_amount),bins =30, fill="darkblue") +
  theme_bw()

  • I guess we can kind of see most fares are between 0 - 100.
  • Hard to tell where the main distribution is spread around.
  • We can enhance this using a log transformation.
df$transform_fare_amount <- log(df$fare_amount)
ggplot(df)+
  geom_histogram(aes(x=transform_fare_amount),bins = 30, fill="darkblue") +
  theme_bw()

  • Take a log transformation to visually see the distribution

Do a correlation check.

cor(df$trip_distance,df$fare_amount,method = "pearson")
## [1] 0.899978
library(pheatmap)
#remove characters
pheatmap(cor(df[,-c(1:3,9)]),
         cluster_rows = FALSE,
         cluster_cols = FALSE)

  • trip_distance highly correlates with high tips, tolls and overall trip amount
  • payment_type seems to have some form of negative correlation with tip_amount. Must be careful as this is a discrete category.
cors =c("passenger_count", "trip_distance", "fare_amount", "extra", 
 "mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", "total_amount")
cor(df[,cors])
##                       passenger_count trip_distance   fare_amount        extra
## passenger_count          1.0000000000   0.001658362  0.0009505592 -0.003093358
## trip_distance            0.0016583622   1.000000000  0.8999780236 -0.223323624
## fare_amount              0.0009505592   0.899978024  1.0000000000 -0.207022734
## extra                   -0.0030933583  -0.223323624 -0.2070227338  1.000000000
## mta_tax                  0.0131316387  -0.027817098 -0.1282543385  0.130305153
## tip_amount              -0.0111149872   0.530623359  0.5573861481 -0.100025231
## tolls_amount             0.0065090920   0.572059235  0.5307124782 -0.157824284
## improvement_surcharge    0.0040463644   0.015943787  0.0489414626  0.095486378
## total_amount            -0.0007104667   0.898539215  0.9824415224 -0.186521180
##                           mta_tax  tip_amount tolls_amount
## passenger_count        0.01313164 -0.01111499  0.006509092
## trip_distance         -0.02781710  0.53062336  0.572059235
## fare_amount           -0.12825434  0.55738615  0.530712478
## extra                  0.13030515 -0.10002523 -0.157824284
## mta_tax                1.00000000 -0.06226707 -0.094005581
## tip_amount            -0.06226707  1.00000000  0.359383270
## tolls_amount          -0.09400558  0.35938327  1.000000000
## improvement_surcharge  0.59152689  0.01722265  0.004483847
## total_amount          -0.12168476  0.68568923  0.603590674
##                       improvement_surcharge  total_amount
## passenger_count                 0.004046364 -0.0007104667
## trip_distance                   0.015943787  0.8985392152
## fare_amount                     0.048941463  0.9824415224
## extra                           0.095486378 -0.1865211796
## mta_tax                         0.591526892 -0.1216847619
## tip_amount                      0.017222652  0.6856892276
## tolls_amount                    0.004483847  0.6035906745
## improvement_surcharge           1.000000000  0.0486015276
## total_amount                    0.048601528  1.0000000000
pheatmap(cor(df[,cors]))

Other plots to present correlations.

library(corrplot)
## corrplot 0.90 loaded
corrplot(cor(df[,cors]), method = 'number')

corrplot(cor(df[,cors]), method = 'color', order = 'alphabet') 

‘AOE’ is for the angular order of the eigenvectors.

corrplot(cor(df[,cors]), method = 'color', order = 'AOE')  # after 'AOE' reorder

More references to draw cor plots.

Study Payment type and Distance

library(forcats)
df = df %>% 
  mutate(payment_type = as_factor(payment_type)) %>% 
  mutate(payment_type = recode(payment_type,`1` = "Credit Card", `2` = "Cash", `3` = "No charge", `4` = "Dispute"))
# Check category frequencies
fct_count(df$payment_type)
## # A tibble: 4 x 2
##   f               n
##   <fct>       <int>
## 1 Credit Card 66601
## 2 Cash        32955
## 3 No charge     336
## 4 Dispute       108
#Collapse Cash/No charge/Dispute
df = df %>% 
  mutate(payment_type = fct_collapse(payment_type, 
                                     Credit_Card = c("Credit Card"), 
                                     Cash_Or_Others=c("Cash","No charge","Dispute")))
ggplot(df, aes(payment_type)) + 
  geom_bar(fill=c("pink","orange")) + 
  ylab("Frequency") + 
  ggtitle("Barplot of Payment Methods")+ 
  theme_bw()

ggplot(df) +
  geom_histogram(aes(x=trip_distance, fill=payment_type))+ 
  theme_bw()

ggplot(df) +
  geom_histogram(aes(x=trip_distance, fill=payment_type),
                 breaks=seq(0,20, by=1))+ 
  theme_bw()+
  scale_fill_brewer(palette = "Paired")

Use facets.

ggplot(df) +
  geom_histogram(aes(x=trip_distance, fill=payment_type),
                 breaks=seq(0,20, by=1))+ 
  theme_bw()+
  facet_grid(.~payment_type) +
  scale_fill_brewer(palette = "Paired")

The default setting of position to draw two histgrams together is by stacking them instead of their identity.

Plot the information of payment type and trip distance together in one.

ggplot(df) +
  geom_boxplot(aes(x=payment_type,
                  y=trip_distance,
                  fill=payment_type))+ 
  scale_y_continuous(trans='log2') +
  scale_fill_brewer(palette = "Paired") +
  theme_bw()

Super useful ggplot2 cheat sheet!!

A brief summary

Visualisations

  • Good visualizations vs Bad visualizastions

Personal Checklist for Visualisations and Dashboards:

  • Your visualisation needs to tell a story.

  • It should be intepretable without being overly verbose.

  • The scale and axis need to make sense (and you can assume the reader knows the difference between a normal scale vs log scale).

  • The choice of visualisation needs to make sense:

  1. Line plot vs Bar chart with non-numerical categories
  2. Scatterplot vs Histogram plot to see distribution
  3. etc
  • Choice of colour scheme / alpha / size need to be easy on the eyes.

At the end of the day, even if you think your visualisation is “pretty” or “beautiful”, if a reader cannot understand it, then it is not a good visualisation.

Use Spark in R

See sparklyr tutorial.

SessionInfo

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
##  [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
##  [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_0.5.1      corrplot_0.90      pheatmap_1.0.12    RColorBrewer_1.1-2
##  [5] feather_0.3.5      tmaptools_3.1-1    tmap_3.3-2         ggmap_3.0.0       
##  [9] ggplot2_3.3.5      curl_4.3.2         sf_1.0-1           dplyr_1.0.7       
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7            httr_1.4.2              tools_4.0.3            
##  [4] bslib_0.2.5.1           utf8_1.2.1              R6_2.5.0               
##  [7] KernSmooth_2.23-17      DBI_1.1.1               colorspace_2.0-2       
## [10] raster_3.4-13           withr_2.4.2             sp_1.4-5               
## [13] tidyselect_1.1.1        leaflet_2.0.4.1         compiler_4.0.3         
## [16] leafem_0.1.6            cli_3.0.0               labeling_0.4.2         
## [19] bookdown_0.22           sass_0.4.0              scales_1.1.1           
## [22] classInt_0.4-3          hexbin_1.28.2           proxy_0.4-26           
## [25] stringr_1.4.0           digest_0.6.27           rmarkdown_2.9          
## [28] base64enc_0.1-3         dichromat_2.0-0         jpeg_0.1-8.1           
## [31] pkgconfig_2.0.3         htmltools_0.5.1.1       highr_0.9              
## [34] htmlwidgets_1.5.3       rlang_0.4.11            rstudioapi_0.13        
## [37] jquerylib_0.1.4         generics_0.1.0          farver_2.1.0           
## [40] jsonlite_1.7.2          crosstalk_1.1.1         magrittr_2.0.1         
## [43] s2_1.0.6                Rcpp_1.0.7              munsell_0.5.0          
## [46] fansi_0.5.0             abind_1.4-5             lifecycle_1.0.0        
## [49] stringi_1.7.2           leafsync_0.1.0          yaml_2.2.1             
## [52] plyr_1.8.6              grid_4.0.3              parallel_4.0.3         
## [55] crayon_1.4.1            lattice_0.20-41         stars_0.5-3            
## [58] hms_1.1.0               knitr_1.33              pillar_1.6.1           
## [61] rjson_0.2.20            codetools_0.2-16        wk_0.5.0               
## [64] XML_3.99-0.6            glue_1.4.2              evaluate_0.14          
## [67] leaflet.providers_1.9.0 png_0.1-7               vctrs_0.3.8            
## [70] rmdformats_1.0.2        RgoogleMaps_1.4.5.3     gtable_0.3.0           
## [73] purrr_0.3.4             tidyr_1.1.3             assertthat_0.2.1       
## [76] xfun_0.24               lwgeom_0.2-6            e1071_1.7-7            
## [79] class_7.3-17            viridisLite_0.4.0       tibble_3.1.2           
## [82] units_0.7-2             ellipsis_0.3.2